The basic idea for this analysis is that to check what is causing people to join/visit hospitals and what are the top diagnostic codes that affects the people.

##       X.1                X               Year             NewBorn       
##  Min.   :      1   Min.   :     1   Min.   :2001   New Born   : 177289  
##  1st Qu.: 411542   1st Qu.: 97706   1st Qu.:2002   Not NewBorn:1468876  
##  Median : 823083   Median :180014   Median :2003                        
##  Mean   : 823083   Mean   :179415   Mean   :2003                        
##  3rd Qu.:1234624   3rd Qu.:262322   3rd Qu.:2004                        
##  Max.   :1646165   Max.   :375372   Max.   :2005                        
##                                                                         
##    UnitsofAge         Age            Sex        
##  Min.   :1.000   Min.   : 0.00   Female:971461  
##  1st Qu.:1.000   1st Qu.:24.00   Male  :674704  
##  Median :1.000   Median :48.00                  
##  Mean   :1.248   Mean   :45.77                  
##  3rd Qu.:1.000   3rd Qu.:71.00                  
##  Max.   :3.000   Max.   :99.00                  
##                                                 
##                              Race          MaritalStatus   
##  White                         :864820   Divorced : 34477  
##  Black                         :230778   Married  :237782  
##  Other                         : 71794   Separated:  6187  
##  Asian                         : 11519   Single   :291457  
##  American Indian/Alaskan Native:  5260   Widowed  : 77999  
##  (Other)                       :  2722   NA's     :998263  
##  NA's                          :459272                     
##  DischargeMonth                             DischargeStatus   
##  Min.   : 1.000   Alive, disposition not stated     :  76773  
##  1st Qu.: 3.000   Dead                              :  27043  
##  Median : 6.000   Medical Advice                    :  12079  
##  Mean   : 6.429   Routine                           :1050035  
##  3rd Qu.: 9.000   transferred to long-term facility : 102889  
##  Max.   :12.000   transferred to short-term facility:  35621  
##                   NA's                              : 341725  
##    DaysofCare       LengthofStay      X.GeoLocation   
##  Min.   :  1.000   Min.   :0.0000   MidWest  :472279  
##  1st Qu.:  2.000   1st Qu.:1.0000   NorthEast:361134  
##  Median :  3.000   Median :1.0000   South    :588806  
##  Mean   :  4.746   Mean   :0.9827   West     :223946  
##  3rd Qu.:  5.000   3rd Qu.:1.0000                     
##  Max.   :561.000   Max.   :1.0000                     
##                                                       
##       HospitalType                     Diagnosis.Code.1  
##  Charity    :1359992   Deliver-single liveborn : 159528  
##  Government : 142457   Single lb in-hosp w/o cs: 124939  
##  Proprietary: 143716   Single lb in-hosp w cs  :  46262  
##                        Pneumonia, organism NOS :  43554  
##                        CHF NOS                 :  42278  
##                        (Other)                 :1126583  
##                        NA's                    : 103021  
##                  Diagnosis.Code.2                   Diagnosis.Code.3 
##  Hypertension NOS        :  45332   Hypertension NOS        : 77979  
##  CHF NOS                 :  42764   CHF NOS                 : 32627  
##  Atrial fibrillation     :  34973   DMII wo cmp nt st uncntr: 27214  
##  Chr airway obstruct NEC :  29821   Atrial fibrillation     : 26943  
##  Urin tract infection NOS:  26172   Chr airway obstruct NEC : 23292  
##  (Other)                 :1175380   (Other)                 :971534  
##  NA's                    : 291723   NA's                    :486576  
##                  Diagnosis.Code.4                  Diagnosis.Code.5 
##  Hypertension NOS        : 76975   Hypertension NOS        : 63944  
##  DMII wo cmp nt st uncntr: 29290   DMII wo cmp nt st uncntr: 25731  
##  CHF NOS                 : 19004   Tobacco use disorder    : 17524  
##  Tobacco use disorder    : 18649   Crnry athrscl natve vssl: 17477  
##  Crnry athrscl natve vssl: 18450   Hyperlipidemia NEC/NOS  : 16441  
##  (Other)                 :827810   (Other)                 :690579  
##  NA's                    :655987   NA's                    :814469  
##                  Diagnosis.Code.6                  Diagnosis.Code.7  
##  Hypertension NOS        : 47110   Hypertension NOS        :  33098  
##  DMII wo cmp nt st uncntr: 19328   DMII wo cmp nt st uncntr:  13997  
##  Crnry athrscl natve vssl: 14769   Crnry athrscl natve vssl:  11798  
##  Tobacco use disorder    : 14498   Hyperlipidemia NEC/NOS  :  11638  
##  Hyperlipidemia NEC/NOS  : 14254   Esophageal reflux       :  11293  
##  (Other)                 :557030   (Other)                 : 449030  
##  NA's                    :979176   NA's                    :1115311  
##  ModeofPayment    X.secondpayment     Admissiontype   
##  Min.   : 1.000   Min.   : 1        Elective :323414  
##  1st Qu.: 2.000   1st Qu.: 5        Emergency:586090  
##  Median : 3.000   Median : 6        NewBorn  :177288  
##  Mean   : 5.413   Mean   : 6        Urgent   :297601  
##  3rd Qu.: 6.000   3rd Qu.: 8        NA's     :261772  
##  Max.   :99.000   Max.   :10                          
##                   NA's   :1228352                     
##               SourceofAdmission 
##  Emergency             :592054  
##  Physician referral    :441439  
##  Other                 :194751  
##  Transfer from hospital: 44860  
##  Clinical referral     : 22566  
##  (Other)               : 28650  
##  NA's                  :321845

We could see that there were more females than males and also there were more number of Emergency Admissions.

Univariate Plots Section

## [1] 5.1 4.1 4.1 2.1
## nhdsdatadf$Sex: Female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.519   5.000 400.000 
## -------------------------------------------------------- 
## nhdsdatadf$Sex: Male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   5.072   6.000 561.000

Most of the discharges were less than 3 days.

Most of the patients were senior citizens. Also we could see the graph showed bimodal distribution. This was due to the admissions in regards to delivery of babies.

We could see that most number of patients come to hospital for routine check-up.

Something interesting could be seen that age of patients were less than one year but here we could see that there were more Non-New borns that New borns.

Whites are predominant race in this dataset.

We could see that patients were more in number from Souther Region followed by North-East and Mid-West regions of United States.

We could see that Married patients were more than any other patients followed by Widowed. There are less number of patients from Seperated category. So it meant either seperated people were more healthier or health concious in a way preventive kind of.

Highest number of patients use Medicare option followed by HMO/PPO.

Apart from Medicare option, Blue cross and Private Insurance paid patients were more.

More number of discharges were from Charity hospitals.

There were more number of Emergency Admissions followed by Physician referrals.

Congestive Heart Failure patients were highest that were visiting hospitals followed by Pneumonia. This mean most of the citizens of US were suffering from Heart related problems at least between 2001 and 2005.

From secondary Diagnosis Code we could infer that Atrial fillibration was the cause of congestive Heart failures.

Hypertension was the next big thing that people were suffering with and this was the reason we get the source of admission as routine with higher statistics.

Atherosclerosis - forming plaque in blood vessels was the next highest diagnostic problem people were identified with.

The percentage of Whites is more among the races.

Univariate Analysis

What is the structure of your dataset?

There were 1646165 observations in the dataset with 36 features and I have selected omitted 10 features for this analysis. This dataset was mainly about the discharge status of US patients for a period 2001-2005. Their diagnosis, race, payment method etc were some features of interest.

Other Observations:

  • New borns were less
  • People tend to go to Charity hospitals more
  • Whites were more prone to diseases as compared to other Race.
  • Apart from NA, Single patients were highest and seperated were lowest.
  • Mean of DaysofCare
## [1] 4.745727
  • Mean of Age
## [1] 45.76869
  • Median of DaysofCare
## [1] 3
  • Median of Age
## [1] 48
  • Standard Deviation of DaysofCare
## [1] 6.995645
  • Standard Deviation of Age
## [1] 28.37315

What is/are the main feature(s) of interest in your dataset?

The main features in the dataset were Race,Sex, Age, Discharge Status, Diagnosis Code, Source of Admission and Mode of payment.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Weight and Geolocation were likely to contribute to the discharge status of the patients.

Did you create any new variables from existing variables in the dataset?

No.I have not created any new variable.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

We could see bimodal distrbution for Age feature. I have done lot of cleaning the data such as removing spaces and blanks, replacing diagnosis codes based on ICD codes, factoring almost all features.

Bivariate Plots Section

We could see that the boxplot for Males was taller than the Females which suggest that Males stayed more than the Females. Most of the Females were dsicharged/stayed at the hopsital for same number of days relatively. We wanted to check whether males or females stayed longer in hospitals.

Though females were on higher side of count, males stayed longer than females. This was intersting to observe that though men were physically stronger, they take more time to recover than women. We wanted to see whether the proportion of females were more when missing values included as well and as expected females has higher proportion of visiting and staying in hospitals.

Compared to Females, Males started visiting hospitals at a relative younger age. Median of Males were on a bit higher side than Females. We wanted to check the hospital visits’ age range based on gender while considering NA values, and we can see here Males were visiting more than females.

Female discharges are higher between the ages 20-35 for obvious reasons such as pregnancy. After 50 males and females who were visiting hospitals were same in number.However, at later stages again female patients who visit hospitals were increased in number which says females are more prone for various ailments.

We could observe that most of the diseases started from late 30s’. Yes most of them are outliers that significantly change the way we interpret the data. However, we can’t deny the fact of people falling ill early. Interestingly we could see pregnant women at as high as 47 years.

We could see the trend that, as the age advances the number of days of stay also increases. The expectation was - days of care should increase as the age increase and the chart showed the same.

Apart from routine check-up, more females were advised to go for long-term facility than men.

##         
##           Alive   Dead Medical Advice Routine long term stay
##   Female  43728  13838           4872  623974          64676
##   Male    33045  13205           7207  426061          38213
##         
##          short term stay
##   Female           18426
##   Male             17195

Comparitively Physical referral has more number of females than males.

##                                         
##                                          Female   Male
##   Clinical referral                       14733   7833
##   Court/law enforcement                    1167   1571
##   Emergency                              322776 269278
##   HMO referral                             4131   1468
##   Other                                   97577  97174
##   Physician referral                     299991 141448
##   Transfer from hospital                  22737  22123
##   Transfer from other health facility      6562   5916
##   Transfer from skilled nursing facility   4794   3041

Most of the Emergency cases were recorded between ages 60 and 75.

##         
##          MidWest NorthEast  South   West
##   Female  277582    207066 351787 135026
##   Male    194697    154068 237019  88920

People in Souther region were more prone to diseases Mid-west region has the healthiest people.

##         
##          MidWest NorthEast  South   West
##   Female  277582    207066 351787 135026
##   Male    194697    154068 237019  88920

Patients or the people visiting hospitals were more in Souther region than other regions.The expectation was - which region is more prone to diseases or the people of which region fall ill often.

Most of the people were availing Medicare option for payment followed by HMO/PPO and private insurance.

The bimodal distribution for Females was from pregnancy and regular check-ups. Also we could see Men were visiting regularly as early as from the age of 35.

We could see that Black men were more healthier than White men and Black women as well.

The percentage of Whites is more than any other race. The chart was expected to find the percentage of Races according to mode of admission. Among the source of admission, Emergency and Physical referral were the more than any other mode of admission.

##                                         
##                                          MidWest NorthEast  South   West
##   American Indian/Alaskan Native             200      1189   2405   1466
##   Asian                                      779      1692   2552   6496
##   Black                                    51282     50878 114466  14152
##   Multiple race                               14       111    108    128
##   Native Hawaiian or other Pacific Isldr     101       163    507   1590
##   Other                                     4351     22201  34684  10558
##   White                                   155328    245473 347152 116867

Whites from Southern location fell ill often than other races from all the regions. The chart was drawn to understand the area-wise percentage of different races.

Comparitively Whites were more prone to ill-health than any other race. We have omitted other races as they were relatively very negligible. The expectation was - Whites percentage should be more than the other races to support previous claims.

The expectation from this chart was that the percentage of females should be more than Males to support previous claims.

Admissions were more in Charity hospitals than in any other hospitals. We wanted to see which type of hospitals were serving more people even though we came to know that most people pay through Medicare. So the question now is - Are the charity hospitals really doing charity?

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The dataset I used was limited to 5 years data. Its something interesting that there was relationship between the race and location. Whites were highly prone to heart attacks as that was the top diagnosis code whites see the doctor. Also we could observe from the dataset that Whites see doctor regularly as early as from 35th year.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Lot of Charity activities were going in Southern region as compared to other regions. It could be infered that there were way big number of charity hospitals in Midwest and South. Though Mid-west has slightly more number of hospitals than South, South recorded more number of charity discharges than Mid-West. This seems Southern region was under developed or way too rich as the there might be many people started charity organizations in their regions to support the citizens. However we could see that there way too high number of proprietary hospitals that confirm the existence of affordable or rich class of people in South.

What was the strongest relationship you found?

The strongest relation I found was the Race Vs Location. This might also be due to the fact of the white population more in the region. However given the fact that Whites were almost double the population of Blacks, the proportion of hospital visits was not according to the population. So Whites were more prone to the diseases than Blacks,

Multivariate Plots Section

From the heat map we could see that Delivery Single born live started at the beginning of teen age itself. Coronary Heart Failure (CHF) was observed at relative younger age of around 35.

## [1] 11
## [1] 58

The min age for females that joined hospital for Delivery is 11 and max age is 58.

## Crnry athrscl natve vssl Single lb in-hosp w/o cs Obs chr bronc w(ac) exac 
##                      202                      235                      363 
## Subendo infarct, initial                  CHF NOS  Pneumonia, organism NOS 
##                      900                     1208                     1559

The top 5 diagnostic codes that people were discharged as Dead. From the above Diagnostic Codes people were joining hospitals under Emergency Category.

We could see that infant boys died most than infact girls.

## [1] 5
## [1] 76
## [1] 9.126502
## [1] 69.99652
## [1] 20.6802
## [1] 14.17939
##      Atrial fibrillation Subendo infarct, initial Obs chr bronc w(ac) exac 
##                     9679                    10165                    11727 
## Crnry athrscl natve vssl                  CHF NOS  Pneumonia, organism NOS 
##                    12847                    26166                    26400

Whites were using Propreitary Hospitals even though Charity and Government Hospitals. At the same time Whites were the most that visit hospitals. Though there were more charity hospitals in Southern region, people used more charity hospitals in North East region than Southern region. So this might be due to the affordable class in South compared to other regions.

We could see the range of Men was more when compared to Women. Men might be suffering with various ailments. Also we could see that except in Mid-west region, Men were prone to diseases since childhood.

Men were more prone to ailments at early age than Women. We could also observe that there was nearly 10 years age difference between Men’s and Women’s life span.

We could see a dip in Medical advice for North-East people. We could infer that they visit hospitals at later stages of diseases. Also peple from North-Eastern region were discharged as dead more than any other region. Either facilities might not be there or people visit hospitals at later stages where ailment was not curable even transferred to long-term facility as we could see from the graph that people from this region were transferred more to long-term facility.

##                  
##                   MidWest NorthEast  South   West
##   Alive             19975     15978  32448   8372
##   Dead               7581      6297   9883   3282
##   Medical Advice     2504      4029   4257   1289
##   Routine          293772    217138 387703 151422
##   long term stay    34756     26954  29587  11592
##   short term stay   10317      8095  13131   4078

However, the statistics shows a different version. So missing values were playing an important role in this particular analysis. We could infer much accurately if we could check data for few more years before coming to any conclusion.

This chart concreted the claim of Routine check-up as the most used Discharge status. Also the delivery of boys were slightly higher than that of girls.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Women were healthier than Men and we could see that the White people were more prone to ailments compared to any other race. It could also be inferred that the Southern region has more number of hospitals as we could see more number of hospitals there.

Were there any interesting or surprising interactions between features?

Also Souther region has more number of propreitary hospitals compared to other regions that signifies more number of affordable class present there. So we could co-relate more money more ailments.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

Women lived longer than Men. Applying log scale to y-axis shwowed even more detailed information. Even Women who got discharged from hospitals were more in number than Men.Though we have higher number of women in the given dataset, the data of Men which was recorded showed the health trend among Men in US across all regions. We could also observe that baby boys were more prone to death when compared to baby girls.

Plot Two

Description Two

Most of the People at later age were sent to long-term facility. This mean there were very good facilities available that could treat dealy diseases. Median for Alive Discharge status is around 60 which is good and also the Median for Dead is around 65. Routine check-ups started at as early as late 30s.(probably Diabeties)

## : American Indian/Alaskan Native
## : Alive
## [1] 61
## -------------------------------------------------------- 
## : Asian
## : Alive
## [1] 68
## -------------------------------------------------------- 
## : Black
## : Alive
## [1] 60
## -------------------------------------------------------- 
## : Multiple race
## : Alive
## [1] 61.5
## -------------------------------------------------------- 
## : Native Hawaiian or other Pacific Isldr
## : Alive
## [1] 59
## -------------------------------------------------------- 
## : Other
## : Alive
## [1] 57
## -------------------------------------------------------- 
## : White
## : Alive
## [1] 72
## -------------------------------------------------------- 
## : American Indian/Alaskan Native
## : Dead
## [1] 65
## -------------------------------------------------------- 
## : Asian
## : Dead
## [1] 77
## -------------------------------------------------------- 
## : Black
## : Dead
## [1] 67
## -------------------------------------------------------- 
## : Multiple race
## : Dead
## [1] 77.5
## -------------------------------------------------------- 
## : Native Hawaiian or other Pacific Isldr
## : Dead
## [1] 73
## -------------------------------------------------------- 
## : Other
## : Dead
## [1] 65
## -------------------------------------------------------- 
## : White
## : Dead
## [1] 77
## -------------------------------------------------------- 
## : American Indian/Alaskan Native
## : Medical Advice
## [1] 43
## -------------------------------------------------------- 
## : Asian
## : Medical Advice
## [1] 50
## -------------------------------------------------------- 
## : Black
## : Medical Advice
## [1] 43
## -------------------------------------------------------- 
## : Multiple race
## : Medical Advice
## [1] 63
## -------------------------------------------------------- 
## : Native Hawaiian or other Pacific Isldr
## : Medical Advice
## [1] 46
## -------------------------------------------------------- 
## : Other
## : Medical Advice
## [1] 41
## -------------------------------------------------------- 
## : White
## : Medical Advice
## [1] 44
## -------------------------------------------------------- 
## : American Indian/Alaskan Native
## : Routine
## [1] 32
## -------------------------------------------------------- 
## : Asian
## : Routine
## [1] 34
## -------------------------------------------------------- 
## : Black
## : Routine
## [1] 38
## -------------------------------------------------------- 
## : Multiple race
## : Routine
## [1] 38
## -------------------------------------------------------- 
## : Native Hawaiian or other Pacific Isldr
## : Routine
## [1] 33
## -------------------------------------------------------- 
## : Other
## : Routine
## [1] 26
## -------------------------------------------------------- 
## : White
## : Routine
## [1] 43
## -------------------------------------------------------- 
## : American Indian/Alaskan Native
## : long term stay
## [1] 70
## -------------------------------------------------------- 
## : Asian
## : long term stay
## [1] 78
## -------------------------------------------------------- 
## : Black
## : long term stay
## [1] 74
## -------------------------------------------------------- 
## : Multiple race
## : long term stay
## [1] 78
## -------------------------------------------------------- 
## : Native Hawaiian or other Pacific Isldr
## : long term stay
## [1] 71
## -------------------------------------------------------- 
## : Other
## : long term stay
## [1] 74
## -------------------------------------------------------- 
## : White
## : long term stay
## [1] 80
## -------------------------------------------------------- 
## : American Indian/Alaskan Native
## : short term stay
## [1] 63.5
## -------------------------------------------------------- 
## : Asian
## : short term stay
## [1] 70
## -------------------------------------------------------- 
## : Black
## : short term stay
## [1] 57
## -------------------------------------------------------- 
## : Multiple race
## : short term stay
## [1] 57
## -------------------------------------------------------- 
## : Native Hawaiian or other Pacific Isldr
## : short term stay
## [1] 44
## -------------------------------------------------------- 
## : Other
## : short term stay
## [1] 50
## -------------------------------------------------------- 
## : White
## : short term stay
## [1] 67

Plot Three

Description Three

Crnry athrscl natve vssl was more common amonge Men. Even Pneumonia was also among the top diseases that affected Men more than Women. As this is a pollution related ailment, we could infer that there were more working Men than Women because working men were exposed more to pollution outside. Another interesting finding is more Women were discharged with Rehabilitation procedure than Men.


Reflection

The NHDS data set contains 1646165 observations from 2001 to 2005. I started cleaning the data first and factoring them. All NA values were omitted from the analysis. Yes, I could see that there is difference/variance in the relation between the variables when compared to the dataset with missing values to the dataset without missing values. As this is the Survery data it was assumed that the outliers were natural. I also mapped all the diagnostic codes with appropriate short descriptions using scripts. I started exploring each variable in the dataset and then explored interesting questions such as which gender was more prone to seeing doctor apart from regular delivery for Women. Different type of charts are showing different kind of perspectives and is really harder to pick the best ones that shows some of the relations between the featuers. I even consulted a coach from Udacity and after discussion I came to know that, I have taken really a complext data set. So I followed the coach’s suggestions such as trying to identify the co-relation between diagnostic codes and discharge status and sex etc,. I could see some strong relation between race and the diagnosis code. Eventually I explored deeper to see any demographic connection for diagnosis code. Also the other thing I could observe is the population mostly consist of Whites and Blacks and the proportion to ailments between whites and Blanks were very different. Even I googled a bit about the regions in US to cross check the population and socio-economic conditions to cross verify my findings. Blacks were not that proportional to the ailments as their population represent whereas Whites had very much higher proportion. In fact Diabeties, for which US is notorious for, is not on the top 5 diagnostic list as it might have covered under routine check-up and the status shows the same that Routine check-up has the highest count. A recent data set would have been better to come to a solid conclusion.

The main idea for future work is that we could take the dataset of Whites (as they were more in number in the dataset) alone along with more years data and could get what really bothering/affecting them.